Dialectal Arabic Orthography-based Transcription

نویسندگان

  • Mohamed Maamouri
  • David Graff
  • Hubert Jin
  • Christopher Cieri
  • Tim Buckwalter
چکیده

The present paper describes the experience gained at LDC in the collection and transcription of conversational dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives. principles, and methodological choices of dialectal Arabic transcription, (c) design features of LDC‟s „Arabic MultiDialectal Transcription Tool‟ (AMADAT) and metalanguage transcription issues, and finally (d) a summary description of the technical specifications, process, current results and issues of the EARS Levantine Arabic Conversational Telephone Speech Collection. 1.0 Introduction: Arabic Language Background The Arabic language is a „linguistic continuum‟ (Hymes, 1973) with two major poles representing an Arabic Standard, the language of most written and formal spoken discourse, and a collection of related Arabic dialects, which are mainly spoken and which present significant phonological, morphological, syntactic, and lexical differences among themselves and when compared to the standard written forms. This situation, usually referred to as „diglossia‟ (Ferguson, 1959), presents some challenging problems for Arabic spoken language technologies, including corpus creation to support Speech-to-Text (STT) systems, since the spoken Arabic dialects are not officially written and have no standardized writing in spite of growing but still relatively small and not wholly conventionalized web activities. A significant amount of linguistic variation occurs and produces many variant forms which are difficult to identify and regroup. 1.1 Arabic Dialectal Variation The diglossic situation described above represents a significant linguistic distance between all Arabic dialects and the „fusha,‟ commonly identified as „Modern Standard Arabic‟ (MSA), though the latter term does not cover all features of the former. This linguistic distance is characterized by substantial phonological, morphological, and lexical variation. Arabic dialectal variation is significant not only between major dialects, (e.g. Egyptian, Levantine, Gulf, or Maghrebi) but also between the regional variants of any major dialect (e.g. Northern and Southern Levantine) and even between the subdialects of any regional variant. Since important sound change has occurred in all Arabic dialects, the complexity of the above situation resides in the existence of significant differences between the phonologies of the various Arabic dialects. In Levantine Arabic (LA), for instance, the sound /q/ is pronounced /q/ but also /‟/, /g/ and /k/. In Egyptian Arabic, /?/ replaces /q/ with few lexical exceptions and not in all subdialects. In Sudanese Arabic, MSA /q/ is replaced by /g/ and sometimes the uvular [  ]. All of the above creates confusion which needs to be addressed and taken into account in any dialectal transcription task. 1.2 Nature of the Dialectal Arabic Transcription

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Conventional Orthography for Dialectal Arabic

Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited...

متن کامل

Dialectal Arabic Telephone Speech Corpus: Principles, Tool design, and Transcription Conventions

The present paper presents the experience gained at LDC in the collection and transcription of a corpus of conversational telephone speech in dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives, principles, and methodological choices of dialectal Arabic transcription, (c) conceptualization and design features of LDC’s ‘Arabic Multi-Dialectal Tran...

متن کامل

Collecting Data for Automatic Speech Recognition Systems in Dialectal Arabic Using Games with a Purpose

Building Automatic Speech Recognition (ASR) systems for spoken languages usually suffer from the problem of limited available transcriptions. Automatic Speech Recognition (ASR) systems require large speech corpora that contain speech and their corresponding transcriptions for training acoustic models. In this paper, we target the Egyptian dialectal Arabic. As other spoken languages, it is mainl...

متن کامل

A Conventional Orthography for Tunisian Arabic

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum ...

متن کامل

Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine

Arabic dialects present a special problem for natural language processing because there are few Arabic dialect resources, they have no standard orthography, and they have not been studied much. However, as more and more written dialectal Arabic is found on social media, natural language processing for Arabic dialects has become an important goal. We present a methodology for creating a morpholo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013